-
Notifications
You must be signed in to change notification settings - Fork 586
Enable PP and EP overlap for MoE #1721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3a61b86 to
0f7a7c9
Compare
|
Running with: CUDA_LAUNCH_BLOCKING |
0f7a7c9 to
6584aac
Compare
a6e46c7 to
5810c54
Compare
|
Just landed pytorch/pytorch#162016, so once CI picks up the nightly the errors should be fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very cool! Left some comments and questions.
Also looking forward to benchmarking results with overlapping enabled vs. disabled. In particular, for the 16B model, we should be able to test out on 8 GPUs, assuming SAC is composable.
|
|
||
| [activation_checkpoint] | ||
| mode = "selective" # ["none", "selective", "full"] | ||
| mode = "none" # ["none", "selective", "full"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it not support SAC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AC/SAC are not supported since we split the backward into two parts.
| mscale=0.70, | ||
| use_flex_attn=True, | ||
| attn_mask_type="block_causal", | ||
| use_flex_attn=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is FlexAttention not supported? It sounds unrelated.
9e43a67 to
7cf98e4
Compare
Fixed one issue with FSDP last reshard not being called. Rest is mostly refactoring, changing some variables to be class variables so they can be used in pytorch/torchtitan#1721 Pull Request resolved: #165513 Approved by: https://github.com/fegin
…#165513) Fixed one issue with FSDP last reshard not being called. Rest is mostly refactoring, changing some variables to be class variables so they can be used in pytorch/torchtitan#1721 Pull Request resolved: pytorch#165513 Approved by: https://github.com/fegin
7cf98e4 to
c29fa82
Compare
Option 2 of #1682
These changes add a custom
overlap_callbackfunction to replace the OVERLAP_F_B action that is run during the schedule execution. In the custom function, we writerun_forward()andrun_backward().run_backward()is run as a separate thread so that we can have both forward and backward running together side by side. Looks like this:In order for these changes to work with Expert Parallel, we also need to add custom autograd functions to act as the boundary points at which we do communication. We added hooks before and after expert parallel dispatch and combine to signal boundary points, so our figure from before now turns into:
Now in each of these red blocks, we use a global coordinator. We need
threading.Barrier(2).wait()so that the comm and compute from our forward and backward steps are scheduled in lock-step before continuing.DSv3 16B run command:
Trace examples: